7 June 2017
All materials are on https://github.com/kevinwang09/SUMS_2017_R_workshop.
Go and download this file (https://github.com/kevinwang09/SUMS_2017_R_workshop/raw/master/dirty_data.xlsx)
Familiar yourself with the iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.
Please run these codes on your laptop prior to the lecture.
install.packages("tidyverse") ## Might be a while...
install.packages("janitor")
install.packages("plotly")
Statisticians are great at many things:
But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)
Powerful tools to organise your data so you can focus the statistics.
Passive learning is not going to work.
readr and readxljanitormagrittrdplyrggplot2R: tidyversebase R functions haven't adapted to the needs of modern users.
readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.
library(readr)
dirtyIris = read_csv("dirtyIris.csv")
class(dirtyIris)
[1] "tbl_df" "tbl" "data.frame"
dirtyIris
# A tibble: 250 x 6
SepAl....LeNgth `Sepal.? Width` `petal.Length(*&^` `petal.$#^&Width`
<dbl> <dbl> <dbl> <dbl>
1 5.80000000 2.6 4.0 1.2
2 5.80000000 2.7 5.1 1.9
3 0.01874617 NA NA NA
4 NA NA NA NA
5 NA NA NA NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
# allEmpty <chr>
tibble is a more advanced version of data.frame.
readxl and haven (for SAS, SPSS etc.) packages work similarly.
We now proceed to data cleaning on the dirtyIris dataset.
clean data as a data set that allows you to do statistical modelling without extra processing
first name vs First Name)cricket_australia vs cricket.australia)character or Date, not a factor ordered alphabetically (default in R).Clean data is a well-designed data.frame.
base R functions might struggle with all issues we mentioned.
Basic data cleaning using janitor package.
More advanced data manipulation through dplyr.
library(janitor) dirtyIris
# A tibble: 250 x 6
SepAl....LeNgth `Sepal.? Width` `petal.Length(*&^` `petal.$#^&Width`
<dbl> <dbl> <dbl> <dbl>
1 5.80000000 2.6 4.0 1.2
2 5.80000000 2.7 5.1 1.9
3 0.01874617 NA NA NA
4 NA NA NA NA
5 NA NA NA NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
# allEmpty <chr>
## Clean up column names better = clean_names(dirtyIris) better
# A tibble: 250 x 6
sepal_length sepal_width petal_length petal_width species allempty
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 5.80000000 2.6 4.0 1.2 versicolor <NA>
2 5.80000000 2.7 5.1 1.9 virginica <NA>
3 0.01874617 NA NA NA <NA> <NA>
4 NA NA NA NA <NA> <NA>
5 NA NA NA NA <NA> <NA>
# ... with 245 more rows
## Removing empty rows/columns evenBetter = remove_empty_cols(better) evenBetter = remove_empty_rows(evenBetter) evenBetter
# A tibble: 209 x 5
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.80000000 2.6 4.0 1.2 versicolor
2 5.80000000 2.7 5.1 1.9 virginica
3 0.01874617 NA NA NA <NA>
4 6.50000000 3.0 5.5 1.8 virginica
5 6.30000000 2.9 5.6 1.8 virginica
# ... with 204 more rows
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
## Similar to na.omit(), any rows with NA will be removed. evenBetterBetter = remove_missing(evenBetter) glimpse(evenBetterBetter)
Observations: 150 Variables: 5 $ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,... $ sepal_width <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,... $ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,... $ petal_width <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,... $ species <chr> "versicolor", "virginica", "virginica", "virginic...
cleanIris = evenBetterBetter
This is the original iris data:
glimpse(iris)
Observations: 150 Variables: 5 $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)
We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".
Reading codes from left to right. And compounded operations are possible.
Keyboard shortcut is Cmd+shift+M.
cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>% density %>% plot(col = "red", lwd = 2)
sepal_length and sepal_width in the cleanIris data:## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]
## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]
## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]
cleanIris[(cleanIris[,"sepal_length"] < 5) &
(cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
petal_length sepal_length
<dbl> <dbl>
1 1.3 4.5
2 1.4 4.4
3 3.3 4.9
4 4.5 4.9
R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.subset(cleanIris,
subset = (sepal_length < 5) & (sepal_width < 3),
select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
select columns are operations on variables.filter rows are operations on observations.
See cheatsheet.
library(dplyr)
cleanIris %>%
filter(sepal_length < 5, sepal_width < 3) %>%
select(contains("length"))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
mutate creates new columnsiris_mutated = mutate(cleanIris,
newVar = sepal_length + petal_length,
newNewVar = newVar*2) %>%
rename(lengthSum = newVar,
lengthSumSq = newNewVar)
iris_mutated
# A tibble: 150 x 7
sepal_length sepal_width petal_length petal_width species lengthSum
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.8 2.6 4.0 1.2 versicolor 9.8
2 5.8 2.7 5.1 1.9 virginica 10.9
3 6.5 3.0 5.5 1.8 virginica 12.0
4 6.3 2.9 5.6 1.8 virginica 11.9
5 5.1 3.8 1.9 0.4 setosa 7.0
# ... with 145 more rows, and 1 more variables: lengthSumSq <dbl>
group_by + summarise will create summary statistics for grouped variablesbySpecies = cleanIris %>% group_by(species) bySpecies
# A tibble: 150 x 5
# Groups: species [3]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.8 2.6 4.0 1.2 versicolor
2 5.8 2.7 5.1 1.9 virginica
3 6.5 3.0 5.5 1.8 virginica
4 6.3 2.9 5.6 1.8 virginica
5 5.1 3.8 1.9 0.4 setosa
# ... with 145 more rows
bySpecies %>% summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
species meanSepalLength
<chr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
selectbySpecies %>%
summarise_if(is.numeric,
funs(m = mean))
# A tibble: 3 x 5
species sepal_length_m sepal_width_m petal_length_m petal_width_m
<chr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
left_join for merging dataflowers = data.frame(species = c("setosa", "versicolor", "virginica"),
comments = c("meh", "kinda_okay", "love_it!"))
## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")
## Randomly sampling 6 rows
sample_n(iris_comments, 6)
# A tibble: 6 x 6
sepal_length sepal_width petal_length petal_width species comments
<dbl> <dbl> <dbl> <dbl> <chr> <fctr>
1 6.0 3.4 4.5 1.6 versicolor kinda_okay
2 5.6 3.0 4.5 1.5 versicolor kinda_okay
3 5.8 2.7 5.1 1.9 virginica love_it!
4 4.8 3.0 1.4 0.1 setosa meh
5 5.1 3.5 1.4 0.2 setosa meh
6 5.7 2.9 4.2 1.3 versicolor kinda_okay
arrange for ordering rowsarrangeCleanIris = cleanIris %>% arrange(sepal_length, sepal_width, petal_length, petal_width) ## The original data arrangeIris = iris %>% clean_names() %>% arrange(sepal_length, sepal_width, petal_length, petal_width)
dirtyIris data and the arranged iris data. Are they the same as I claim to be?## The `Species` column is character or factor all.equal(arrangeCleanIris, arrangeIris)
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% mutate(species = as.character(species)) ## Great! all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal ## arrangeCleanIris is a tibble ## but arrangeIris is a data.frame identical(arrangeCleanIris, arrangeIris)
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df" "tbl" "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same! identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE
ggplot2: best visualisation package EVERSee tutorial sheet. (https://gauss17gon.shinyapps.io/ggplot2_basic_tutorial/)
Di Cook - the real reason that you should use ggplot2 is because its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.
Every characteristic (e.g. axes, colour, size) of your graph is a column in a data.frame. i.e. you can't plot vector x in line 1 and vector newx in line 100.
library(ggplot2)
p1 =
iris %>%
ggplot(aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point(size = 3)
p1
p2 =
iris %>%
ggplot(aes(x = Sepal.Length,
y = Sepal.Width,
colour = Petal.Length,
shape = Species)) +
geom_point(size = 3) +
scale_color_distiller(palette = "Spectral") +
theme_bw() +
theme(legend.position = "bottom")
p2
iris. i.e. The x-axis should have a \(4 \times 3\) grouping.data.frame is bad for this purpose.tidyr::gather to create two new column in place of the 4 numerical columns:
variable, which indicates (petal/sepal-width/length)value, the corresponding value for that variable for the Species## Initial plotting to decide which variable should be matched to which variable
gatherData = iris %>%
gather(key = variable,
value = value,
-Species) ## This is the column we don't want to throw away.
dim(gatherData)
[1] 600 3
head(gatherData)
Species variable value 1 setosa Sepal.Length 5.1 2 setosa Sepal.Length 4.9 3 setosa Sepal.Length 4.7 4 setosa Sepal.Length 4.6 5 setosa Sepal.Length 5.0 6 setosa Sepal.Length 5.4
gatherData %>%
ggplot(aes(x = variable,
y = value,
fill = Species)) +
geom_boxplot() +
theme_minimal()
tidyverse (optional)tidyverse: tidy data + tidy codingtidyverse is a collection of 20 packages (still expanding!) built on the philosophy that both your data and coding should be tidy.
plotly::ggplotly(p2)
What I discussed today is really about being "tidy" and having a reproducible pipeline. The same principle applies when you are doing any projects.
Use RStudio + RMarkdown to document your codes. Controversial: plain R isn't flexible enough to handle large projects and documentations.
You don't have to adopt everything I taught you today! Find the component that you like to use and adapt that into your work routine. (Hint: start with cheatsheets and build up gradually.)
Take time to re-analyse an old dataset.
The design/implementation philosophy is not for everyone. e.g. tibble does not have row names, which makes sense if you data is large but less so if your data is small.
Learn the core functions in every package first! All modern package has a tutorial or vignette page.
Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.
sessionInfo()
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.3 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2 janitor_0.3.0 dplyr_0.7.2 purrr_0.2.2.2 [5] readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1 [9] tidyverse_1.1.1 knitr_1.16 loaded via a namespace (and not attached): [1] reshape2_1.4.2 haven_1.1.0 lattice_0.20-35 [4] colorspace_1.3-2 htmltools_0.3.6 viridisLite_0.2.0 [7] yaml_2.1.14 plotly_4.7.0 rlang_0.1.1 [10] foreign_0.8-69 glue_1.1.1 RColorBrewer_1.1-2 [13] modelr_0.1.0 readxl_1.0.0 bindr_0.1 [16] plyr_1.8.4 stringr_1.2.0 munsell_0.4.3 [19] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 [22] htmlwidgets_0.9 psych_1.7.5 evaluate_0.10.1 [25] labeling_0.3 forcats_0.2.0 httpuv_1.3.5 [28] crosstalk_1.0.0 parallel_3.4.1 broom_0.4.2 [31] Rcpp_0.12.12 xtable_1.8-2 scales_0.4.1.9002 [34] backports_1.1.0 jsonlite_1.5 mime_0.5 [37] mnormt_1.5-5 hms_0.3 digest_0.6.12 [40] stringi_1.1.5 shiny_1.0.3 grid_3.4.1 [43] rprojroot_1.2 tools_3.4.1 magrittr_1.5 [46] lazyeval_0.2.0 pkgconfig_2.0.1 data.table_1.10.4 [49] xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0 [52] rmarkdown_1.6 httr_1.2.1 R6_2.2.2 [55] nlme_3.1-131 compiler_3.4.1